This work aims to compare red and white wines datasets. Both datasets are available on the dataset options here for this project.
There main question that we will try to answer is:
This report explores a dataset of red and white wines about many perspectives. Red wines dataset has information about 1,599 wines. White wines dataset has information about 4,898 wines. Both databases have 6,497 lines and 13 variables.
Red Wines:
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median : 2.200 Median :0.07900 Median :14.00 Median : 38.00
## Mean : 2.539 Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :15.500 Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
White Wines:
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median : 2.200 Median :0.07900 Median :14.00 Median : 38.00
## Mean : 2.539 Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :15.500 Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
Compare both classes of wine for each attributes:
Histogram over all variables on the database:
Red and White wines over all attributes in median values:
Best and Worse Red Wines comparation:
Best and Worse White Wines comparation:
Red wines dataset has information about 1599 wines. White wines dataset has information about 4898 wines. Both databases have 6,497 lines and 13 variables.
The main feature in the data set is quality. We would like to determine which are best and minimal combination of features for determine the quality of a wine.
Others features that will help our analysis for both wines: age of wines, kind of grapes, price of the botter, region of wine, is a blend or not. For Red Wines we have visible differences when we compare hight and low quality wines. It is possible notice that alcohol, citric.acid and volatile.acidity are (apparently) inversely proportional. However, white wines have a remarkable difference in alcohol attribute and subtle differences in pH and density.
No. I created a new dataset joining red and white wines datasets.
It was necessary to adjust the dataset to make them tailored to use libraries to build the presented graphs.
Some observations:
Alcohol: In general, both wines (red and white) have the same distribution of alcoholic graduation but red wines have more alcohol than white wines. An interesting point is that we found white wines with 14% of alcohol concentration and red wines with 8% of alcohol concentration;
pH: In general, red wines have a pH bigger than white wines. At this point we must to do two considerations: 1) pH is a logarithm scale and does it mean that the small differences in this scale represents differences in fact of 10x; 2) When ph values are small it means an acid environment. Otherwise, when ph is increasing we have an alkaline environment. We can observe that ph and citric acid are inversely proportional and this is confirmed in our dataset. White wines are more acid and red wines are more alcoholic;
Acidity:
Citric Acid: In general, white wines are more citric than red wines and it is natural due to the grapes used in the process;
Volatile Acidity: In geral, red wines have more volatile acidity than white wines;
Fixed Acidity: In general, red wines have more fixed acidity than white wines;
Chlorides: In general, red wines have more chlorides than white wines probably relate to the physical-chemical production process. For both wines are many variability about this attribute;
Density: In general, red wines have more density than white wines. Density is an important factor to harmonize with fat because of that it is common to serve red wine with fatty meats. This is an expected result. White wines are refreshing and much density is not interesting for this propouse;
- Sulphates: In general, red wines have more sulphates than white wines, but for both wines are a low variability for this variable.
Sulfur:
- Total and Free Sulfur Dioxide: Based in Sulfur Dioxide is used to prevent oxidation and microbial growth. However, an excessive amounts of SO2 can inhibit fermentation and cause undesirable sensory effects.
- Residual Sugar: In general, red wines have next to nothing residual sugar. White wines have more variability and more residual sugar than red wines. The distribution of this variability seems to be skewed;
- Quality: Even with different combinations of attributes, both wines arrives similar quality.
Conclusions:
There are many interesting things this graph shows to us:
All Wines:
Red Wines: We are interested in understanding the behavior of quality over other variables considering just red wines.
White Wines:
Now we are interested in understanding the behavior of quality over other variables considering just white wines.
All Wines The differences between best and worst wines are subtle for both types of wines (red and white). The data guide us to understand that more alcohol and citric acid associate with less density and chlorides is related to max quality in both types of wine. This evidence agree with the oenology theory when good wines has a good balance between alcohol, density and citric acid. Maybe chlorides and sulphates are substances added to process to get the balance of the wine.
Red Wines If we observe just red wines, maximum quality it is obtained when:
Best Best Red Wines has more alcohol, more citric acid, more sulphates, less ph, less density and chlorides. These attributes show the contrast between best and worse red wines.
White Wines If we observe just red wines, maximum quality it is obtained when:
Best Best white Wines has more alcohol, more citric acid, more sulphates, less density and chlorides when compared with worse wines. However, more ph while best red wines has less ph when we compare best and worse wines.
General Conclusions * For both types of wines the variables more related with quality are: alcohol, density and citric acid. This evidence agree with the oenology theory when good wines has a good balance between alcohol, density and citric acid;
Outliers Analysis: There is a white wine with max quality and a small percentage of alcohol. This is an interesting outlier to be analyzed. It is possible to realize that on this particular case the small percentage of alcohol was associated with higher values to residual sugar, fixed acidity and density. Maybe to give to this exemplar the balance needed.
## Warning: Removed 2 rows containing missing values (geom_segment).
## Warning: Removed 2 rows containing missing values (geom_point).
Observing bivariate analysis is it possible to see the relation between alcohol, density and acidity with quality. The hypothesis was quality was a result of balance between alcohol, density and acidity. If true, the other variables are a byproduct of winemaking. Therefore, we built a model to measure quality as a balance between alcohol, density and acidity. Obviously it seemed a simplification of reality. This would eliminate all the creativity of escaping the obvious and achieving excellence by purposefully highlighting some element. This would also force the qualification of good, those specimens that would fit in a proportion. Even so, we believe that for most wines this simplification of reality would be enough. After all, not every wine is exotic.
Yes. It is possible describe two interesting facts. 1) Alcohol variable is the most related with quality. And it is valid for all wines (red and white). 2) Alcohol by itself is able to explain quality about 45%.
At this point we can see alcohol and density related with quality for all wines, but not acidity. We also tried to relate acidity variables with alcohol and density but without success. After that, we tried do build a function that relate alcohol and density to explain quality. We build three models to explain quality:
\[ f(a,d) = \sqrt{a . d} \]
\[ f(a,d,c) = \sqrt[3]{a . d . c} \]
\[ f(a,d) = \frac{a}{d} \]
The first model understand quality as a balance between alcohol and density. The second model understand quality as balance between alcohol, density and acidity (what is very related on the reality). Third model understand quality as a proportion between alcohol and density. We check the correlation results with quality values to measure the ability of each model do explain the quality variable. It is possible to see that alcohol variable by itself explain ~ 45% of quality. When we join alcohol with other variables also related with quality on the bivariate analysis we can’t improve the correlation between the model and the quality variable.
## Warning: package 'GGally' was built under R version 4.0.5
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
This work aims to understand quality in red and white wines datasets. This report explores a dataset of red and white wines about many perspectives. Red wines dataset has information about 1,599 wines. White wines dataset has information about 4,898 wines. Both databases have 6,497 lines and 13 variables. We join both datasets and work to understand quality based on single model for both types of wine. We built three models to derive quality based on these variables.
We observed that alcohol is able to explain about 45% of quality variable for both wines (red and white). We related alcohol with other variables to determine quality by proportion. We could see that alcohol with other variables with good correlation with quality can’t improve the rate of explanation.
At this point we generate new hypothesis: Maybe we need more information about the wines, like year of production, grapes used to production, local of production (terroir) and other variables related to taste.